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ABSTRACT 



o 

5^ . Aims. The Gaia astrometric survey mission will, as a consequence of its scanning law, obtain low resolution optical (330-1000 nm) 

' spectrophotometry of several million unresolved galaxies brighter than V=22. We present the first steps in a project to design and 

' implement a classification system for these data. The goal is both to determine morphological classes and to estimate intrinsic astro- 

physical parameters via synthetic templates. Here we describe (1) a new library of synthetic galaxy spectra, and (2) first results of 
classification and parametrization experiments using simulated Gaia spectrophotometry of this library. 
(N ■ Methods. We have created a large grid of synthetic galaxy spectra using the PEGASE.2 code, which is based on galaxy evolution 

^ I models that take into account metallicity evolution, extinction correction, emission lines (with stellar spectra based on the BaSeL 

04 , library). Our classification and regression models are Support Vector Machines (SVMs), which are kernel-based nonlinear estimators. 

• Results. We produce a basic library of about 3600 zero redshift galaxy spectra covering the main Hubble types over wavelength range 

250 to 1050 nm at a sampling of 1 nm or less. It is computed on a regular grid of four key astrophysical parameters for each type and 
for intermediate random values of the same parameters. An extended library reproduces this at a series of redshifts. Initial results from 
1^-^ , the SVM classifiers and parametrizers are promising, indicating that Hubble types can be reliably predicted and several parameters 

^— ■>) ■ estimated with low bias and variance. Comparing the colours of our synthetic library with Sloan Digital Sky Survey (SDSS) spectra 

' we find good agreement over the full range of Hubble types and parameters. 

' Key words. - Galaxies: fundamental parameters - Techniques: photometric - Techniques: spectroscopic 
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^ 1. Introduction 3500 to 6500 A) of 30 years ago; ill) The spectrophotometry cov- 
ert ■ r T ■ -J • r ■ 1 ■ 1 1 1 ers a larger spectral range (3300 to 10 000 A sampled in about 
^ Large surveys or galaxies provide mrormation on their global \.u ^■ ■ \ ■ u u 

■ ■ , • , ■ ■ I - 1 1 100 bins) than earlier surveys; iv) Gaia observes each source an 

spatial distribution and the physical properties or individual con.- • • xir.u .u- 

^, . „ , c , 1111 average or 80 times over the mission. With this we can investi- 

galaxies. Such a survey will be obtained tor the whole sky by ^ „ , ^o/-v j az-^m • 

f, „„ . . . ^ r -,rvi 1 arvi^ • £ gate many difterent types or galaxy, QSO and AGN variability; 

the ESA mission, Gaia, from 2011-2016. During its five year = -' ji- a j j 

mission Gaia will observe several million unresolved galaxies 
all over the whole sky. Although the survey's main goal is the 



v) The sample will have a well-defined selection function, im- 
portant for estimating the galaxy density in the local universe. 



stellar content and the structure of our galaxy, there remains a ^ur long-term objective is to classify and to determine the 
lot of important science to be extracted from the galactic com- astrophysical parameters of all unresolved galaxies which Gaia 



ponent. 



will observe. In order to proceed with this we first need to ac- 



There currendy exist several surveys of galaxies, but even f"'^ ^^i^^ an appropriate library of galaxy spectra. This li- 

cT-.cc . . A A ^ A brary must show sufhcient variation in those intrinsic astrophys- 

SDSS - one of the most extended galaxy photometric and spec- . , , . . i • i , ^ ■ i • -h i 

• 1 A TTj / u » » »u ical parameters (APs) to which the Gaia observations will be 

troscopic surveys in the the optical and near IR (about at the • r„ , . . „ 

1 f ■ \ 1 (kf.u f.u 1 ■ sensitive. To determine APs on a homogeneous system we ul- 

spectral range of Gaia) - covers only a ntth of the sky. Gaia ex- . , ,., t -^ ■ 

, -s T» n u ui » A. * u . ia7 timately need to build or calibrate our classifiers using synthe- 

tends this in several ways: i) It will be able to detect about 10 . ■;, , ,. . . , i ,. 

,1 , • J /-. OA /\r OA TON ■■\ 1^ ■ n u SIS models and synthetic spectra. Existing observed or synthetic 

unresolved galaxies down to G=20 (V=20-22); ii) Gaia will be ,., . hi, , • i i i 

the first homogeneous survey of galaxies covering the whole sky 1^^''^"^^ ^1°" ' 'r^^f' *^ '^'^''''^^ wavelength 

1 . ^xt"i^ ttc/^ ni c u J* range. For this reason we set on in this paper to Start building a 

since photographic ones (UK, ESO, Palomar Schmidt surveys, ^ ^ ° 

new library. 
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dFioc & Rocca-Volmerangd 1 19971 Il999d) . to synthesize 
galaxy spectra. The PEGASE.2 cod43 is aimed principally at 
modelling the spectral evolution of galaxies by types: the active 
and passive evolution of stellar populations as well as interstellar 
gas and dust are coherently evolved in time. No galaxy number 
density evolution is considered, although the results of our mod- 
els are compatible with occasional rare galaxy merging. The 
code is based on the stellar evolutionary tracks from the Padova 
group, extended to the thermally pu lsating asymptotic giant 
branc h (AGB) and post-AGB phases dOroenewegen & de Jond 
Il993h . These tracks cover all the masses, metaUcities and 
phases of interest for galaxy spectral synthesis. PEGASE.2 
uses the BaSeL 2.2 library of stellar spectra and can synthesize 
low resolution (R=200) ultraviolet to near-infrared spectra 
of Hubble sequence galaxies, as well as of starbursts. For a 
given evolutionary scenario (typically characterized by a star 
formation law, an initial mass function and, possibly, infall or 
galactic winds), the code consistently gives the spectral energy 
distribution (SED) and computes the star formation rate and the 
metallicity at any time. The nebular component (continuum and 
lines) due to HIl regions is calculated and added to the stellar 
component. Depending on the geometry of the galaxy (disk 
or spheroidal), the attenuation of the spectrum by dust is then 
computed using a radiative transfer code (which takes account 
of the scattering). 

By accepting a s tar formation r ate pr oportional to mass of 
the gas, the IMF of iRana & Basul (Il992h and the presence of 
infall and galactic winds, eight synthetic spectra corresponding 
to different typical types of Hubble sequence galaxies (E, SO, 
Sa, Sb, Sbc, Sc, Sd and Im) have already been produced us- 
ing PEGASE.2 (lFiodll 997^ iFi oc & Rocca-Volmerangd [19991 
iLe Borgne & Rocca-Vo imerangd l2002h . For each type, the val- 
ues of the parameter set have been fitted to the observed spec- 
tral energy distribution (SED) of nearby (z=0) galax i es. Fo r 
illustration a comparison with data is shown in lFio3 (Il999h . 
At higher redshifts, the evolution scenarios have been tested 
against most existing faint galaxy sample s, including the deep - 
est surveys (B=29 Hubble Deep Field-N, Williams et anil996h . 
One unique model of galaxy fractions by type simultaneously 
predicts the multi-wavelength (UV to near-IR) galaxy counts, 
dominated by young stellar populations in the UV and old 
evolved galaxies in the near-IR respectively. The faint blue 
galax y p opulation , in excess in t he far-UV, has also been anal- 
ysed (IFioc & Rocca-VolmerangQ, 1999b.) . An episodic star for- 
mation rate of low l evel i s proposed t o fit t he far-UV counts 
dArmand & MilliardI Il994t iBuait et all Il999l) . In the near-IR, 
the evolution scenario of elliptical galaxies predicts the puz- 
zling K-z relation of radio galaxy hosts between z=0 and 
z-A. iRocca-Volmerange et all ( |2004 use PEGASE.2 scenar- 
ios to interpret the galaxy distribution in the K-band Hubble 
diagram. The s ame models are used to interpret the mid-IR 
galaxy counts dRocca-Volmerange et alj [2007). although here 
a supplementary ultra-luminous infrared galaxy population is 
required. Finally, the robustness of our evolution scenarios is 
confirmed by the significant predictions of photometric red- 
shifts as compared to sp ectroscopic redshifts of HDF-N sample 
dLe Borgne & Rocca-Volmerange 2002). Using a much larger 
sample from the SDSS, we make an additional comparison. This 
is the subject of the second section of this paper, made using sim- 
ulated photometry and colour-colour diagrams. In section 3 we 
describe the production of our library based on these eight typi- 
cal synthetic spectra of galaxies and in section 4 we explain how 

' http://www2.iap.fr/users/fioc/PEGASE.html 
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these are used to simulate Gaia data. In section 5 we present our 
classification and parametrization models and give preliminary 
results on their performance. A brief discussion follows in sec- 
tion 6. 

2. PEGASE synthetic spectra and comparison with 
the SDSS spectra 

In order to determine the parameter ranges over which we should 
generate the library, we first make a comparison of colours syn- 
thesized from the eight typical PEGASE spectra with SDSS data. 
To avoid small discrepancies that occur between synthesized and 
published SDSS photometr}(3 and to treat both types of spec- 
tral data in the same way, we decided to synthesize SDSS pho- 
tometry from the SDSS spectra in the same way as we do with 
the synthetic spectra (and using the same "calib" and "colors" 
programs in the PEGASE.2 code for both). For this we use the 
whole set of spectroscopic data for the 565 715 galaxies that are 
available in data release 4 (DR4) of SDSS. The properties of the 
SDSS filters are given in Table [1] 

Typical synthetic spectra corresponding to each of the eight 
Hubble types are shown in Fig.[r| with the location of the SDSS 
filters superimposed. Each of these "typical spectra" corresponds 
to specific combination of values of the astrophysical parameters 
(see section 13. It . The SEDs produced by PEGASE have been 
normalized to the flux of a 50A wavelength interval centered 
on 5500A. The elliptical and SO galaxies have very small dif- 
ferences, apparent at the two extremes of the wavelength range. 
This implies small differences in colours but not necessarily in 
magnitudes (which depend on their masses). 

From Fig. [T]it is obvious that the u filter is very important 
for the comparison with real data since it is the one containing 
the discontinuity around 4000A. However, the SDSS spectra do 
not cover the u band, so photometry in this band cannot be syn- 
thesized. We refrain from using the SDSS photometry for the u 
band because of the red leak in this filtefl which would render 
comparisons with synthetic data unreliable. This leak produces 
erroneous magnitudes, especially for E and SO types on account 
of their large numbers of red stars. 

In addition we avoid using the z filter in our comparison 
since its photometry also cannot be synthesized from the SDSS 
spectra, which terminate at shorter wavelengths than the z pass- 
band. 

We therefore decided to base our comparison between the 
SDSS and PEGASE.2 data using the g, r, i filters only and, more 
specifically, the g-r and r-i colours. However, the wavelength 
range of the SDSS spectra does not quite extend to the bluest 
side of the g filter. For this reason, we cut the blue end of this 
and created a new g filter starting at 3830A instead of the 3630A 

" http://www.sdss.org/dr4/products/spectra/spectrophotometry.html 
^ http://www.sdss.Org/dr4/products/images/index.html#redleak 
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Fig. 1. Synthetic spectra for the eight typical galaxy types from 
PEGASE.2. The vertical lines denote the limits of the five SDSS 
filters (transmission below le-4 of the peak). (Emission lines are 
not included). The legend at the right defines colour used to plot 
each type of galaxy (top) and SDSS filter (bottom). 



(table[T]l. However, this change is in practice very small since the 
transmission of the g filter is only 3% of the peak transmission 
at 3830A and drops very rapidly below that (e.g. it is only 0.5% 
at just lOA lower). Furthermore, simulated photometry from the 
synthetic spectra showed virtually no difference for the original 
and "trimmed" g band. The published transmission curves of the 
SDSS filters depend on airmass and whether a point or extended 
source is being observed. We use those for extended sources and 
zero airmass. T he photometry is cali brated on the AB system, as 
used by SDSS dPukugita et al.lll996l) . 

We synthesize photometry using the one-dimensional spec- 
tra from DR4, which are supplied with additional analysis infor- 
mation, such as redshift and emission line parameters. In order 
to select data suitable for our purposes, we applied the follow- 
ing criteria: the galaxies should not be near a CCD edge nor 
saturated, and they should not be very low SNR (the photomet- 
ric error in all bands should be less than 0.1 mag). Only spectra 
with redshifts below 0.01 are retained, since the synthetic spec- 
tra of PEGASE.2 were produced at zero redshift. These criteria 
resulted in a sample of 1292 galaxies. Their synthesized photom- 
etry plus that for the eight typical galaxy types from PEGASE.2 
is shown in Fig.|2] This figure clearly shows that the colours of 
the Im, Sd, Sc, Sbc, Sb and Sa types are generally in good agree- 
ment with the colours of the observed spectra, although in the 
case of SO and E types the synthetic spectra seem to be slightly 
redder in g-r than the SDSS spectra. 



3. The library of synthetic spectra 

3. 1 . The most significant parameters 

Each spectrum in our library is uniquely defined by a set of 17 
astrophysical parameters, plus the morphological type (E, SO, 
Sa, Sb, Sbc, Sc, Sd or Im). The four most significant APs are: pi 
and p2 of the star formation scenario {{Mgas''^)/p2y, the infall 
timescale; the age of the galactic winds. The age of the galac- 
tic winds is non-zero only for E and SO galaxies. Note that the 
Hubble type is not an independent parameter, as only certain 
ranges of the APs are available for each type (as will be detailed 
later). 



Fig. 2. Colour-colour (g-r vs. r-i) diagram of synthesized pho- 
tometry of SDSS galaxy spectra (black) and synthetic photome- 
try of the eight typical galaxy types generated from the PEGASE 
models (red points). 
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Fig. 3. Colour-colour (g-r vs r-i) diagram of synthesized pho- 
tometry of SDSS galaxy spectra (black) and of synthetic 
PEGASE spectra of the typical Sbc model (yellow) and the mod- 
els of Sbc with different values of pi (red). The largest g-r cor- 
responds to pl = l and the smallest g-r to pi =2. 



In order to investigate the influence of each of the parame- 
ters pi, p2 and infall timescale to the integrated galaxy spectrum 
(SED), we modified the parameters of the Sbc model (an inter- 
mediate type) over a range of values. In the typical model for the 
Sbc type the values were 1, 5714 Myr/Mo and 6000 Myr for pi, 
p2 and infall timescale, respectively. In the modified models we 
vary pi between 0.4 and 2, p2 from 100 to 20000 Myr/Mg and 
infall from 100 to 10000 Myr The results are shown in Figs.[3]-l5] 

To investigate the effect of the age of the galactic winds pa- 
rameter we followed the same procedure but now with the ellip- 
tical model. In the typical model for the E type the age is 1 Gyr 
and we vary it between 0.1 and 7.5 Gyr (Fig.|6l). 

From the figures we see that these four parameters have a 
major effect on the colours. We performed similar analyses for 
other APs and concluded that they had a much smaller impact on 
the data (in particular once the spectra are reduced to the Gaia 
resolution). Therefore, the spectra in the present library show 
variance only in these four APs. 
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Fig. 4. Colour-colour (g-r vs r-i) diagram of synthesized pho- 
tometry of SDSS galaxy spectra (black) and of synthetic 
PEGASE spectra of the typical Sbc model (yellow) and the mod- 
els of Sbc with different values of p2 (red). The largest g-r cor- 
responds to p2=2000 Myr/Mo and the smallest g-r to p2=20000 
Myr/Mo. 



Fig. 6. Colour-colour (g-r vs r-i) diagram of synthesized pho- 
tometry of SDSS galaxy spectra (black) and of synthetic 
PEGASE spectra of the typical E model (yellow) and the mod- 
els of E with different values of age of galactic winds (red). The 
largest g-r corresponds to age of galactic winds=7.5Gy and the 
smallest g-r to 0. 1 Gyr 



Fig. 5. Colour-colour (g-r vs r-i) diagram of synthesized pho- 
tometry of SDSS galaxy spectra (black) and of synthetic 
PEGASE spectra of the typical Sbc model (yellow) and the 
models of Sbc with different values of infall timescale (red). 
The largest g-r corresponds to infall timescale=100My and the 
smallest g-r to infall timescale=10 Gyr. 



The galaxy type can be considered as a 5th AP, although 
it is of a different nature than the others, since it is needed to 
fully specify the spectrum and constrain the range of values of 
the other four APs. In addition, when one redshifts the spectrum 
to non-zero values of z, this quantity also becomes a parameter 
(albeit not intrinsic to the source). 



3.2. Library of galaxy spectra over a regular grid of 
parameters 

Applying the above procedures, we produced a library of 888 
synthetic spectra covering seven separate Hubble types (because 
we consider E and SO as a single type). The values of the four 
parameters of each type are given in table |2] while the values 
of the other input parameters of PEGASE. 2 (kept constant in all 
models) are given in table |3] The models are plotted in Fig.|7] 
where the simulated colours of the 888 synthetic spectra and the 
1292 SDSS spectra are compared. This first set of 888 synthetic 
spectra was then calculated at five values of redshift: 0, 0.05, 0. 1, 
0.15, 0.2, resulting in a total of 4440 spectra. 



By co-varying these four parameters and using all their com- 
binations in each of the eight typical models we are able to cover 
most of the variance we see in the SDSS data in the colour- 
colour diagram. Generally, there is no clear distinction between 
the colours of neighbouring Hubble types. In order to have a 
knowledge of types in our library we decided (as a first working 
approximation) to only retain those models which lie within a 
circle (in the colour-colour diagram) centered on one of the eight 
typical types and with a radius equal to half of the distance to 
the nearest neighbouring typical model. This is reasonable since 
the models lie mostly on a one-dimensional surface (line) in the 
colour-colour diagram. In this way upper and lower limits of the 
values of the parameters were established for each type, although 
in this case an overlap in APs (if not in colours) remains, as can 
be seen in table |2l This leaves a set of 888 synthetic spectra of 
known types of galaxies (see section [J!2b . 



3.3. Extension of the library to random values of parameters 

After producing the regular synthetic spectral grid (table |2]i, we 
proceed to produce synthetic spectra of galaxies with parameters 
selected from a random distribution, in order to achieve a more 
continuous coverage in colour space. Such grids permit more 
robust tests of parameter estimation algorithms than do regular 
grids. Each parameter is selected independently from a uniform 
distribution over the parameter ranges in the regular grid. We 
used this approach to generate 5500 models. In doing this we 
keep approximately the ratios between the Hubble types as in 
the regular grid. Because the parameter ranges for each galaxy 
type in Table|2]show some overlap, a random draw may produce 
a set of parameters which fits into more than one Hubble type 
category. To remove this "degeneracy" we again apply the circle 
removal method we used in section 13.11 This results in a "non- 
degenerate" sample of 2709 spectra. 
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Table 2. The four astrophysical parameter (AP) ranges for each Hubble type in the regular library of PEGASE synthetic spectra. 
Note that the AP ranges for each Hubble type partially overlap. The morphological type can be considered as an additional (but 
non-independent) parameter, required to fully explain the varia nce in the library. The final column (N) gives the number of spectra 
for each type (which sum to 888). See the regular library grid in lLe Borgne & Rocca-Volmerangd (120021) for comparison. 



Type 


Pl 


p2 


infall 


galactic winds 


N 






(Myr/Msol) 


(Myr) 


(Gyr) 




E-SO 


0.6-1.5 


100-1500 


100-2500 


0.1-7.5 


327 


Sa 


0.8-1.5 


500-2500 


2500-3500 


none 


10 


Sb 


0.6-1.5 


1500-6000 


2000-4500 


none 


25 


Sbc 


0.4-1.5 


2000-10000 


4000-7000 


none 


148 


Sc 


0.6-1.5 


6000-14000 


7000-10000 


none 


68 


Sd 


0.4-1.5 


10000-18000 


7000-10000 


none 


65 


Im 


1.0-2.0 


14000-20000 


7000-10000 


none 


245 



Table 3. The values of the parameters of the PEGASE models which are kept constant in the library dFioc & Rocca-Volmerangel 
Il997h . 



Parameters Values 



SNII Ejecta of massive stars 


model B of Wooslev & Weaver (1995) 


Stellar winds 


yes 


Initial mass function 


Rana&Basud992) 


Lower mass 


0.09 solar masses 


Upper mass 


120.00 solar masses 


Fraction of close binary systems 


0.05 


Initial metallicity 


0.00 


Metallicity of the infalling gas 


0.00 


Consistent evolution of the stellar metallicity 


yes 


Mass fraction of substellar objects 


0.00 


Nebular emission 


yes 


Extinction 


disk geometry: inclination-averaged 




for Sa, Sb, Sbc, Sc, Sd and Im 




spheroidal geometry for E-SO 


Age 


13 Gyr for E-S0,Sa, Sb, Sbc, Sc & Sd 




9 Gyr for Im 



SDSS data, except for the small differences in the E and SO 
galaxies. 

In summary, we have produced a library of 7149 synthetic 
galaxy spectra (888 spectra of the regular grid for 5 values of 
redshift and 2709 of the random grid at zero redshift) which can 
be used as an initial library of unresolved galaxy spectra for as- 
sessing the possibilities of galaxy classification and parametriza- 
tion with Gaia. This library was created at the resolution of the 
BaSeL 2.2 stellar library (gradually changing from 8 A at 2500 A 
to 50 A at 10 500 A), which is not quite high enough for the Gaia 
simulation software (which requires 10 A). Therefore, we lin- 
early interpolated our spectra in order to resample the spectra to 
10 A over the wavelength range of 2500-10500 A. Higher res- 
olution spectra will be produced in future work using the High- 
spectral Resolution code PEGASE-HR (iLe Borgne et al.„2004i) . 

4. Simulated Gaia spectra 

The Gaia spectrophotometer is a slitless prism spectrograph 
comprising blue and red channels (called BP and RP respec- 
tively) which operate over the wavelength ranges 3300-6800A 
and 6400-10500 A respectively. BP and RP spectra were simu- 
lat ed for a l l 7149 library spectra using the simulator developed 
bv lBrownl (|2006|) . Each of BP and RP is simulated with 48 pix- 
els, whereby the dispersion varies from 30-290 A/pix and 60- 
150A/pix respectively. We artificially reddened each spectrum 
with a standard interstellar extinction law with R=3.1, for reg- 
ular values of Av from to 10 for the regular library, and for 



/. 



Fig. 7. Colour-colour (g-r vs. r-i) diagram of synthesized 
photometry of SDSS galaxy spectra (black) and of synthetic 
PEGASE spectra of the 8 typical models of PEGASE.2 (yel- 
low). Moving from the lower left to the upper right part of the 
diagram we encounter types from Im to E. The red dots along 
both sides of the typical models represent the spectra of both the 
regular and random Ubrary. 



A comparison of the simulated colours of the synthetic spec- 
tra (888 regular grid plus 2709 random grid, at zero redshift) 
with the colours of SDSS spectra is shown in Fig. |7] One sees 
that the new set of spectra is in very good agreement with the 
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4000 6000 8000 10000 

wavelength (A) 

Fig. 8. The simulated BP and RP spectra of the synthetic spectra 
for the eight typical galaxy types from PEGASE.2. Black, green, 
blue, yellow, magenta, light blue and red denote galaxies of type 
E, Sa, Sb, Sbc, Sc, Sd and Im respectively. 




Fig. 9. The 9691 simulated Gaia galaxy spectra with z=0 plotted 
as their projections onto the first three Principal Components. 
Black, green, blue, light blue, magenta, yellow and red denote 
galaxies of type E, Sa, Sb, Sbc, Sc, Sd and Im respectively. 



10 random values of Ay uniformly distributed in log(l +Av) for 
the the random library. Noise was added to all spectra, which in- 
cludes the source Poisson noise, background Poisson noise and 
CCD readout noise. This is done for five different source G-band 
magnitudes (15, 17, 18, 19 and 20). For the following classifica- 
tion tests we use only the sample at G=18. In Fig.[8]we present 
the simulated BP and RP spectra for the eight typical synthetic 
spectra of galaxies. 



5. Classification & Parametrization 

In the present work we use classification Support Vector 
Machines (SVMs) (C-classification) to determine morphologi- 
cal types and regression SVMs (e-regression) to estimate the 
various astrop hysical parameters. We use the libsvm library of 
IChang & Lin! (2001) implemented in the el071 package in the 
R statistics packageQ A brief description of the SVMs is given in 
the Appendix o f this paper. An accessible i ntroduction to SVMs 
can be found in lBennett & Carnpbell ( 200Q). For a more techni- 
cal introduction, the tutorial bv lBurgesI (119981) is recommended. 



Table 4. Galaxy classification with the SVM. The confusion ma- 
trix for the training set for galaxies at z-Q. Columns indicate the 
true class, row the predicted ones. 



Type 


E-SO 


Sa 


Sb 


Sbc 


Sc 


Sd 


Im 


E-SO 


1799 




















Sa 





1366 

















Sb 








53 


5 











Sbc 











134 











Sc 














830 








Sd 

















347 


1 


Im 




















311 


le 5. As TablelUbut for the test set. 


Type 


E-SO 


Sa 


Sb 


Sbc 


Sc 


Sd 


Im 


E-SO 


1798 




















Sa 





1329 

















Sb 








44 














Sbc 








4 


137 











Sc 











1 


797 








Sd 

















394 


6 


Im 

















3 


324 



5. 1 . Galaxies at zero redshift 

5.1 .1 . Classification of tine morpliological type 

We now try to classify the set of Gaia-simulated galaxy spectra, 
at G=18 with zero redshift, into the seven Hubble types. This 
subset of the library includes characteristic noise and a wide 
range of interstellar extinction (from 0-10 mag in Ai). It com- 
prises 9691 spectra. This we divide at random into two subsets: 
4846 for training the SVM classifiers and 4845 for evaluating 
their performance. As is recommendable with many machine 
learning methods, we first normalized the data by scaling each 
input (pixel) to have zero mean and unit standard deviation. 

For the purpose of visualizing the data set only, we perform 
a Principal Components Analysis (PC A) on the set of 9691 96- 
dimensional Gaia spectra. The first three Principal Components 
describe 78.25%, 20.44% and 1.02% of the data variance respec- 



* |http : //www ■ r-pro j ect . org] 



tively (i.e. 99.71% together)01n Fig.|9]we plot the data in pro- 
jection onto the first three PCs. This diagram, plus the fact that 
the first three PCs explain almost all of the variance in the data, 
suggest that a good classification should be possible (the data 
have an intrinsic low dimensionality). 

The results of training and testing the SVM classifier on the 
full 96-pixel spectra are shown in Tables |4] and |5] We see that 
there are very few misclassifications: only 6 and 14 in the train- 
ing and testing set corresponding to an error of 0. 12% and 0.29% 
respectively. While these results are very promising, it must be 
recalled that the way the library has been constructed avoids 
class overlap in the SDSS g-r, r-i colour space, which surely 
eases separation in the 96-dimensional BP/RP colour space. 



' Note that, because each input dimension has already been normal- 
ized to have zero mean and unit variance, a considerable fraction of the 
total variance is already accounted for. 
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5.1.2. Regression of astrophysical parameters 

In addition to simulating an output spectrum, PEGASE.2 also 
derives 18 output astrophysical parameters for each galaxy. Of 
course, by construction we know that our synthetic spectra are 
uniquely defined by five parameters (pi, p2, infall timescale, age 
of the galactic winds and the Hubble type), so there can only 
be five equivalent independent parameters amongst these 18. 
Nonetheless, it would be useful to predict them directly. Here we 
build S VM regression models to separately predict the nine most 
significant ones (listed in Table|6|l. For each model we train on a 
randomly selected set of 4846 spectra and evaluate performance 
on the remaining 4845. In Fig. [TO] we present the true and the 
SVM-predicted values of each parameter on the test set. Table 
|6] summarizes this by giving the mean of the difference between 
the true and predicted values for each parameter (which mea- 
sures the systematic error) as well as the RMS residual (which 
measures the total scatter). The plots and table indicate that we 
can predict the parameters to good accuracy and precision, i.e. 
the systematics are very small and the RMS error is a small frac- 
tion of the typical values. 



1 "^-r 























Fig. 11. The 48 719 simulated Gaia galaxy spectra with nonzero 
redshift plotted as their projections onto the first three Principal 
Components. Black, green, blue, light blue, magenta, yellow and 
red denote galaxies of type E, Sa, Sb, Sbc, Sc, Sd and Im respec- 
tively. 



5.2. Galaxies with redsliift 

5.2.1. Regression of redsliift and classification of 
morphological type 

We now enlarge the subset of the library we used in the previ- 
ous tests by adding the same galaxies at four nonzero values of 
redshift, specifically 0.05, 0.1, 0.15, 0.2. The library for z=0 in- 
cludes 9691 galaxies as described above. For each nonzero red- 
shift there are 9757 giving a total sample of 48 719 galaxies. 
(Recall that this includes each galaxy simulated at 11 regular 
values of Ai,.) We now build another morphological type classi- 
fication model as done in section |5 . 1 . 1 1 now with 6719 galaxies 
in the training set and 42 000 galaxies for testing set. 

We again applied a PCA to the data. This time the first three 
Principal Components describe 76.01%, 21.63% and 1.02% of 
the data variance respectively (i.e. 98.6% together), very sim- 
ilar to before. The corresponding PCA-project plot is Fig. [TT] 
Comparing to Fig.|9]we can see how the redshift spreads out the 
previous loci of types. The performance of the SVM classifier 
is summarized in Tables [T] and [8] The performance is good con- 
sidering the added complexity introduced by the redshift varia- 
tions (and the corresponding increase in the sample size). The 
misclassification errors are 0.13% and 0.98% corresponding to 
9 and 411 galaxies for the training and the testing data respec- 
tively. 

In practice we may want to first reduce spectra to the 
rest frame, for which we require an estimate of the redshift. 
Therefore, we also set up a SVM regression model to predict 
redshift, using the same training and test sets. The predicted val- 
ues of redshift for each of the five true redshift values are pre- 
sented in Fig. [12] We do not expect very good performance here, 
because the SVM is having to learn the effect of redshift based 
on just five different values. 

6. Discussion and conclusion 

We have used the PEGASE.2 galaxy evolution model and the 
observational data from SDSS to create an extended grid of syn- 
thetic galaxy spectra. Using these we have identified the relevant 
astrophysical parameters and their relevant ranges which provide 
a realistic galaxy spectra of known morphological type. This was 



Table 7. Galaxy classification with the SVM. The confusion ma- 
trix for the training set for galaxies at z=0.0, 0.05, 0.1, 0.15, 0.2. 
Columns indicate the true class, row the predicted ones. 
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E-SO 
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rable 8. As Table|7]but for the test set. 
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done specifically by comparing the colours of our library spectra 
with those synthesized from SDSS spectra. We found small de- 
viations between the two colour loci for redder galaxies - where 
the ellipticals are found - which might be due to the fact that 
SDSS spectra are obtained in a small aperture (fibre diameter) 
while PEGASE spectra are representative of the whole galaxy. 
We also see that the observed sample has a considerably larger 
spread in the colour-colour diagram than the library spectra, 
which probably has observational reasons (photometric errors) 
as well as theoretical ones (insufficient cosmic variance in the 
galaxy models). That is, it may partially reflect the complicated 
nature of galaxy formation and evolution, although the overall 
agreement between the two is good. 

To achieve a better agreement between the observational and 
synthesized libraries we will further investigate the influence of 
the various PEGASE.2 parameters, especially those that were 
kept constant for this release of the library. On the other hand, 
due to the narrow redshift range (z < 0.2) explored here, evolu- 
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Fig. 10. Galaxy parameter estimation performance. For each of the nine APs we plot the predicted vs. true AP values for the test 
set. The red line indicates the line of perfect estimation. The summary errors are given in Table|6] 



Table 6. Summary of the performance of the SVM regression models for predicting the nine APs listed. The sample is for zero 
redshift but for interstellar extinction (A,,) varying from to 10 mag. The second and third columns list the mean and RMS errors 
respectively. The final column gives the number of support vectors in the SVM model. 



Astrophysical Parameter 


mean(real-predicted)/mean(real) 


sd(real-predicted)/mean(real) 


SVs 


mass to light ratio (M/L) 


-1.03e-2 


3.78e-2 


97 


normalized star formation rate (SFR) 


-3.35e-3 


3.97e-2 


2285 


metallicity of interstellar medium (Mim) 


-2.85e-3 


8.77e-2 


345 


metallicity of stars averaged on mass (Msm) 


-3.64e-4 


2.17e-2 


3544 


normalized mass of gas (Mgas) 


4.52e-3 


4.29e-2 


190 


normalized mass in stars (Ms) 


3.22e-4 


5.48e-2 


1639 


mean age of stars averaged on bolometric luminosity (Al) 


1.45e-3 


3.22e-2 


3566 


normalized SNIa rate (SNIa) 


9.69e-4 


3.43e-2 


376 


normalized SNIl rate (SNII) 


-6.04e-4 


3.81e-2 


2247 



tion factors are minimized. At higher redshifts, synthetic spectra 
will be computed by simultaneously applying cosmological k- 
corrections and evolution e-corrections to z-Q templates. 

Among the existing libraries of observed spectra, the most 
complete and homogeneous is the SDSS, since it covers a sig- 
nificant part of the whole sky and it goes fainter than the ex- 



pected detection limit of Gaia. We therefore aim to produce a 
suitable set of synthetic spectra covering as much as possible of 
the SDSS colour range and we plan further comparisons in our 
future work. 

Adding phenomena such as the galaxy mergers is a challeng- 
ing hypothesis, but we believe that at the low redshifts Gaia will 
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predicted z for galaxies with z=0 predicted z for galaxies with z=:0.05 predicted z for galaxies with z=0.10 



-0.015 -0.1)10 -O.t)05 O.bOO O.dOS 001 



0.040 0.045 0.050 0.055 0.060 0.090 0.095 0.100 0.105 0.110 



predicted z for galaxies with z=0.15 predicted z for galaxies with z=0.20 



0.135 0.140 0.145 0.150 0.155 0.160 0.185 0.190 0.195 0.200 0.205 0.210 



Fig. 12. Distribution of predicted values of redshift shows separately for the five true values of redshift (z-O, 0.05, 0.1, 0. 15 and 0.2) 



observe, this is not such an important or frequent mechanism of 
galaxy evolution. On the other hand, starburst galaxies are more 
frequent at small redshifts and we intend to enrich our library 
with this type of galaxy. 

First results of S VM for classification and parametrization of 
the library are quite promising. In particular, the first indications 
are that Gaia will be able to produce a wealth of information for 
a large statistical sample of galaxies. After constructing a more 
complete library of spectra we will be able to perform more tests 
and construct a classifier able to treat more realistic and complete 
simulations of galaxy spectra. 
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Appendix A: Support Vector Machines 



Support Vector Machines (SVMs) (IVapniklll995l) are supervised 
machine learning methods for data classification. In their basic 
form they achieve a linear classification between two classes by 
defining an optimal hyperplane which separates members of the 
two classes. If the classes are separable then there generally ex- 
ists an infinite number of hyperplanes which achieve this. The 
SVM optimal plane is defined as that plane which maximises 
the margin between the opposing class members nearest to the 
boundary. That is, unlike many other classifiers which use all of 
the data to define the boundary, SVMs take the (arguably more 
reasonable) approach of using just those points nearest to the 
boundary. It has been demonstrated that this gives rise to a more 
robust and more accurate classifier under general conditions. 

In most non-trivial problems, however, the classes are not 
linearly separable. In these cases, just those points which lie on 
the wrong side of the hyperplane - the so-called support vectors 

- enter into the total classification error. By minimizing this error 

- which also measures the distance of the support vectors from 
the plane - we define the optimal separating plane, i.e. with the 
fewest misclassifications (and preferentially of those which lie 
closer to the plane). 

In the general case, the classes are not even marginally lin- 
early separable (consider the XOR problem) so a linear classi- 
fier, no matter how optimal, is useless. SVMs address this issue 
by using kernels to project the data into a higher dimensional 
space. For example, with a polynomial kernel we take square, 
cubic etc. combinations of the original data to form additional 
dimensions and then apply the (linear) SVM classifier in this 
higher dimensional space. With many other kernels, however, 
this projection is only carried out implicitly. This approach can 
be thought of as nonlinearity by preprocessing, with the kernel 
overcoming the well known "curse of dimensionality". In the 
present work we use the radial basis kernel 



K(xi - Xj) = exp(-y\\xi - Xj\\ ) 



(A.l) 



Thus with a larger C there is a higher penalty attached to errors, 
i.e. less regularizationQ 

The other hyperparameter in the model is y (equation lA.lb . 
Both y and C must be determined by the user Prior information 
may help, but in practice one carries out a rigorous search over 
a two-dimensional grid to "tune" the SVM. We did this using 4- 
fold cross validation, iterating over grids of increasing density. 

SVMs can also be used for regression. Instead of a hyper- 
plane and a margin about it, regression SVMs fit a line with a 
tube of radius e encompassing it. Data vectors which are less 
than a distance e from the line are considered to be correctly fit, 
that is, the support vectors are only those points outside of the 
tube. Thus the e hyperparameter controls the degree of regular- 
ization. The specific error function we use is the mean squared 
error on the predictions, with the regularization again being in- 
troduced via the constraints in the optimization (with Lagrangian 
multipliers). All of the kernel and optimization machinery ap- 
plies equally to these models, so that nonlinear regression can 
also be achieved. 



where x, and Xj are two input vectors (e.g. spectra). The classifi- 
cation of a new vector x, is then given by a function 



i=N 



f(.^j) = ^yiOiKixi - Xj) 



(A.2) 



where y,- e (-1, 1) denotes the two classes, and a classification 
is made by applying a threshold, e.g. f{xj) > 0.0 class 1. 
The a, are the parameters of the model which are determined by 
the model training (/ counts over the support vectors). SVMs 
have a very important property, namely that the error function 
is strictly convex, so it has a unique global solution which can 
be found in polynomial time with standard optimizers (it is a 
linearly constrained quadratic programming problem). 

This is in marked contrast to neural networks, for example, in 
which the optimizers converge on a local minimum and we can 
only be guaranteed to find the global minimum via an exhaustive 
search. Furthermore, with a sigmoidal kernel SVMs are equiva- 
lent to neural networks but with the additional advantage that the 
SVM automatically determines the neural network architecture 
(number weights). 

The SVM model incorporates regularization via the speci- 
fication of a hyperparameter, C, which defines the width of a 
margin around the separating hyperplane. The wider this margin 
(larger C), the more data vectors which fall into it. These are all 
considered support vectors and so all enter the error equation. 



* C is actually the upper bound on a,, specifically < tt, < C and 
2, o'iji = (two of the constraints in the error minimization). Thus 
a small C implies smaller cf, in equation IA.2I which in turn implies 
smoother functions equivalent to more regularization. 



